Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
Ensimmäisessä harjoituksessa installoitiin tarvittavat ohjelmat ja yhdistettiin verkkoympäristöön. Itselläni oli ongelmia saada R-studio ja GIT keskustelemaan keskenään ja tästä syystä jouduin tekemään installaatiot useaan otteeseen. Näiden ongelmien jälkeen pääsi harjoittelemaan DATACAMP ympäristössä, mutta jostain syystä platform näkyy normaalia huomattavan pienenä ja en ole tätä vielä ratkaissut.
R-Harjoituksissa käytiin läpi hyvin perus R-käyttöä, mutta hyvin nopeasti lopussa syöksyttiin funktioihin. Mielestäni tässä olisi ollut hyvä olla useampi videoluento aiheesta. Jokatapauksessa erittäin mielenkintoinen konsepti. En ole ennen käyttänyt GIT iä tai Markdownia , joten näistäkin on hyvä saada kokemusta.
Describe the work you have done this week and summarize your learning.
DATA tuonti
data <- read.csv("learing_2019.csv")
#Summary
summary(data)
## X gender Age Attitude Points
## Min. : 1.00 F:110 Min. :17.00 Min. :14.00 Min. : 7.00
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00
## Median : 83.50 Median :22.00 Median :32.00 Median :23.00
## Mean : 83.50 Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :166.00 Max. :55.00 Max. :50.00 Max. :33.00
## Deep Stra Surf
## Min. :1.583 Min. :1.250 Min. :1.583
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417
## Median :3.667 Median :3.188 Median :2.833
## Mean :3.680 Mean :3.121 Mean :2.787
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167
## Max. :4.917 Max. :5.000 Max. :4.333
#Access ggplot and GGally
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#Creating graphics
p1 <- ggplot(data, aes(x = Attitude, y = Points, col = gender))
#and fitting linear model
p2 <- p1 + geom_point() + geom_smooth(method = "lm")
p2
In these graphs we can see that gender plays a role. Biggest difference between genders seems to be on variable attitude, where females have a clearly lower mean. The distributions of the attitude and surf variables differ between males and females.
#Correlations
p <- ggpairs(data, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
## Regression model with multible explanatory variables
my_model <- lm(Points ~ Attitude + Stra + Surf, data = data)
# Summary of the model
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## Stra 0.85313 0.54159 1.575 0.11716
## Surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Only attitude seems to be siqnificant in this fitted model. The multiple R squared of the model is in this case simply the square of the regression between Attitude and Points. Since it is around 0,2, approximately 20% of the variation of points can be explained by the variation of Attitude.
# drawing diagnostic plots using the plot() function. Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
The assumptions of the model are that the error terms are approximately normally distributed with mean 0 and identical variation, uncorrelated and independent of the variable of interest. Specifically, the size of the error should not depend on the value of the explanatory or interesting variables.
From these pictures it seems that the model assumptions are approximately correct, with the small exception of small and large values of Points corresponding to some larger deviation from the estimated mean. Also, a couple of observations seem to have somewhat large leverages, but overall the assumptions of the model seem to hold quite well. Questionable is the ends of the spectrum, and additional analysis is needed.
Mette Nissen 15.9.2019 Exercise 3, analysis with student alcohol consumption.
Exercise 3 #Exercise 3
alc <- read.csv("alc.csv", header = TRUE, sep = ",")
library(ggplot2)
library(GGally)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.2
## ✔ tidyr 0.8.3 ✔ dplyr 0.8.3
## ✔ readr 1.3.1 ✔ stringr 1.4.0
## ✔ tibble 2.1.3 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(knitr)
I have taken variables gender, number of school abscences, going out with friends and final grade in the analysis. Hypothesis is that people performing poorly in school and missing classes are in higher risk of using more alcohol.
I also looked the summary and we can see that the mean age is 17 years. From the barchart we can see how high use is more common in men. THe next boxplot shows people having high use in alcohol going out in both genders. Also final grade seems to be lower. Using age as a function for abscences, we can see in the last figure how abscences seem to be higher in older men with high alcohol consumption.
Same things we can see from mean values. Group has 198 female and 184 male. Dividing groups with high use in females no high use vs high use is 156 and 42 respectevely and same numbers in men are 112 and 72, so higher proportion in men are high users. High use doesn´t affect final grade in female, but we can se a difference in grades in male students: 12.2 in non-high use group and 10.3 in high use group. Abscences are higher in both gender with high use and same is seen in going out with friends see tabels mean abscences and mean goout).
#Summary of the datatable
kable(summary(alc), digits = 2)
| X | school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | nursery | internet | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | higher | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | alc_use | high_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.00 | GP:342 | F:198 | Min. :15.00 | R: 81 | GT3:278 | A: 38 | Min. :0.000 | Min. :0.000 | at_home : 53 | at_home : 16 | course :140 | no : 72 | no : 58 | father: 91 | Min. :1.000 | Min. :1.000 | Min. :0.0000 | no :331 | no :144 | no :205 | no :181 | no : 18 | no :261 | Min. :1.000 | Min. :1.00 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. : 0.0 | Min. : 2.00 | Min. : 4.00 | Min. : 0.00 | Min. :1.000 | Mode :logical | |
| 1st Qu.: 96.25 | MS: 40 | M:184 | 1st Qu.:16.00 | U:301 | LE3:104 | T:344 | 1st Qu.:2.000 | 1st Qu.:2.000 | health : 33 | health : 17 | home :110 | yes:310 | yes:324 | mother:275 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:0.0000 | yes: 51 | yes:238 | yes:177 | yes:201 | yes:364 | yes:121 | 1st Qu.:4.000 | 1st Qu.:3.00 | 1st Qu.:2.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.: 1.0 | 1st Qu.:10.00 | 1st Qu.:10.00 | 1st Qu.:10.00 | 1st Qu.:1.000 | FALSE:268 | |
| Median :191.50 | NA | NA | Median :17.00 | NA | NA | NA | Median :3.000 | Median :3.000 | other :138 | other :211 | other : 34 | NA | NA | other : 16 | Median :1.000 | Median :2.000 | Median :0.0000 | NA | NA | NA | NA | NA | NA | Median :4.000 | Median :3.00 | Median :3.000 | Median :1.000 | Median :2.000 | Median :4.000 | Median : 3.0 | Median :12.00 | Median :12.00 | Median :12.00 | Median :1.500 | TRUE :114 | |
| Mean :191.50 | NA | NA | Mean :16.59 | NA | NA | NA | Mean :2.806 | Mean :2.565 | services: 96 | services:107 | reputation: 98 | NA | NA | NA | Mean :1.448 | Mean :2.037 | Mean :0.2016 | NA | NA | NA | NA | NA | NA | Mean :3.937 | Mean :3.22 | Mean :3.113 | Mean :1.482 | Mean :2.296 | Mean :3.573 | Mean : 4.5 | Mean :11.49 | Mean :11.47 | Mean :11.46 | Mean :1.889 | NA | |
| 3rd Qu.:286.75 | NA | NA | 3rd Qu.:17.00 | NA | NA | NA | 3rd Qu.:4.000 | 3rd Qu.:4.000 | teacher : 62 | teacher : 31 | NA | NA | NA | NA | 3rd Qu.:2.000 | 3rd Qu.:2.000 | 3rd Qu.:0.0000 | NA | NA | NA | NA | NA | NA | 3rd Qu.:5.000 | 3rd Qu.:4.00 | 3rd Qu.:4.000 | 3rd Qu.:2.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.: 6.0 | 3rd Qu.:14.00 | 3rd Qu.:14.00 | 3rd Qu.:14.00 | 3rd Qu.:2.500 | NA | |
| Max. :382.00 | NA | NA | Max. :22.00 | NA | NA | NA | Max. :4.000 | Max. :4.000 | NA | NA | NA | NA | NA | NA | Max. :4.000 | Max. :4.000 | Max. :3.0000 | NA | NA | NA | NA | NA | NA | Max. :5.000 | Max. :5.00 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :45.0 | Max. :18.00 | Max. :18.00 | Max. :18.00 | Max. :5.000 | NA |
# Graphs
p <- ggpairs(alc, columns = c("sex", "absences","goout", "G3", "high_use"), mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
?ggpairs
# initialize a barchart of alcohol use difference between genders
g1 <- ggplot(data = alc, aes(x=sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Student alcohol consumption by sex")
g1
# initialize a plot of alcohol use and going out with friends
g2 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = goout)) + facet_wrap("high_use") + ggtitle("Student going out with friends by alcohol consumption and sex")
g2
# Boxplot of alcohol use and school final grade
g3 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = G3)) + facet_wrap("high_use") + ggtitle("Student final grades by alcohol consumption and sex")
g3
# Scatterplot showing linear model between age and abscences separated by alcohol consumption
g4 <- ggplot(data = alc, aes(x = age, y = absences, color=sex, alpha = 0.3)) + geom_point() + facet_wrap("high_use")+ geom_jitter() + geom_smooth(method = "lm") + ggtitle("Student absences by alcohol consumption, sex and age")
g4
alc %>% group_by(sex) %>% summarise(count = n())
## # A tibble: 2 x 2
## sex count
## <fct> <int>
## 1 F 198
## 2 M 184
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 11.4
## 2 F TRUE 42 11.7
## 3 M FALSE 112 12.2
## 4 M TRUE 72 10.3
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_absences
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 4.22
## 2 F TRUE 42 6.79
## 3 M FALSE 112 2.98
## 4 M TRUE 72 6.12
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_goout
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 2.96
## 2 F TRUE 42 3.36
## 3 M FALSE 112 2.71
## 4 M TRUE 72 3.93
knitr::opts_chunk$set(echo = TRUE)
# glm() model
m <- glm(high_use ~ absences + sex + goout, data = alc, family = "binomial")
# the summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + sex + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
# the coefficients of the model
coef(m)
## (Intercept) absences sexM goout
## -4.16317087 0.08418399 0.95871896 0.72980859
# the odds ratio (OR)
OR <- coef(m) %>% exp
# the confidence interval (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
OR
## (Intercept) absences sexM goout
## 0.01555815 1.08782902 2.60835292 2.07468346
CI
## 2.5 % 97.5 %
## (Intercept) 0.005885392 0.03804621
## absences 1.042458467 1.13933894
## sexM 1.593132148 4.33151387
## goout 1.650182481 2.64111050
# printing out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01555815 0.005885392 0.03804621
## absences 1.08782902 1.042458467 1.13933894
## sexM 2.60835292 1.593132148 4.33151387
## goout 2.07468346 1.650182481 2.64111050
From the model alone we can see that all other but G3 are statistically highly significant. Calculating OR and CI we can see that absences, male sex and going out have positive correlation and confidence interval staying above 1 stating significant correlation. In G3 correlation is negative and not significant. For the future predictions I am going to leave G3 from the model.
# probability of high_use
probabilities <- predict(m, type = "response")
# adding the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# using the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))
# the last ten original classes, predicted probabilities, and class predictions
select(alc, goout, absences, sex, high_use, probability, prediction) %>% tail(10)
## goout absences sex high_use probability prediction
## 373 2 0 M FALSE 0.14869987 FALSE
## 374 3 7 M TRUE 0.39514446 FALSE
## 375 3 1 F FALSE 0.13129452 FALSE
## 376 3 6 F FALSE 0.18714923 FALSE
## 377 2 2 F FALSE 0.07342805 FALSE
## 378 4 2 F FALSE 0.25434555 FALSE
## 379 2 2 F FALSE 0.07342805 FALSE
## 380 1 3 F FALSE 0.03989428 FALSE
## 381 5 4 M TRUE 0.68596604 TRUE
## 382 1 2 M TRUE 0.09060457 FALSE
# target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 65 49
The last ten cases show real values in high use TRUE 3 and FALSE 7. In the prediction these numbers are TRUE 1 and FALSE 9.
# a plot of 'high_use' versus 'probability' in 'alc'
pr <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
pr2 <- pr + geom_point() + ggtitle("Predictions")
pr2
Here is calculated the error of our model with cross-validation:
# the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
# a loss function
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241
It seems that the wrong prediction proportion in this model 21% is smaller than 26% in the dataCamp exercise.
# access packages
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(GGally)
library(tidyverse)
library(dplyr)
library(knitr)
library(corrplot)
## corrplot 0.84 loaded
library(tidyr)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# load the data
data("Boston")
This data frame contains the following columns: crim = per capita crime rate by town. zn = proportion of residential land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox = nitrogen oxides concentration (parts per 10 million). rm = average number of rooms per dwelling. age = proportion of owner-occupied units built prior to 1940. dis = weighted mean of distances to five Boston employment centres. rad = index of accessibility to radial highways. tax = full-value property-tax rate per $10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat = lower status of the population (percent). medv = median value of owner-occupied homes in $1000s.
In this weeks exercise we use Boston data set from R MASS package which is a histoical data collected from 606 districts in the area around Boston.
Boston has 14 variables and 506 observations. Crime variable is the response variable.
Variables and their explanations are show above.
#Dataset summary and variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
pairs(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
#Graphical summary of crime variable
ggplot(Boston, aes(crim)) + stat_density() + theme_bw()
#Plotting each variable against crime rate
bosmelt <- melt(Boston, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
facet_wrap(~variable, scales="free")+
geom_point()
boxplot(Boston$crim, Boston$zn, Boston$indus, Boston$chas, Boston$nox, Boston$rm, Boston$age, Boston$dis, Boston$rad, Boston$tax, Boston$ptratio, Boston$black, Boston$lstat, Boston$medv, names = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"))
mlm <- lm(formula = crim ~ ., data = Boston)
summary(mlm)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
Most significant variables in the model are dis and rad with high significance, median value with moderate significance and zone, black with lower but still under p 0.05 significance.
cor_matrix<-cor(Boston)
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot.mixed(cor_matrix, number.cex = .6)
Corrplot shows the relationships between variables. Highest positive correlation are between rad and tax, indus and nox and age and nox. Highest negative correlations are between age and dis, lstat and med and dis and nox. Wee can see from the summaries that distribution of the variables is very inconsistent and thus we need to scale the dataset before doing linear discriminant analysis later.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
With standardizing data is centralized. This is done to continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation. With this variables´mean is 0 and SD is 1.
# creating a quantile vector of crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
#removing crim
boston_scaled <- dplyr::select(boston_scaled, -crim)
#adding categorical variable to the table
boston_scaled <- data.frame(boston_scaled, crime)
For predicting with data we need a model training set which is in this case decided to be 80% of the cases and the rest of the data is used as a test set which shows the accuracy of the model.
n <- nrow(boston_scaled)
#Choosing 80% to the training set
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
# creating the test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2599010 0.2252475 0.2500000 0.2648515
##
## Group means:
## zn indus chas nox rm
## low 0.89912056 -0.8570457 -0.08484810 -0.8596159 0.431575407
## med_low -0.05305028 -0.3461630 -0.05600488 -0.5976699 -0.101719055
## med_high -0.38365582 0.2012935 0.19544522 0.4303616 0.006974305
## high -0.48724019 1.0170108 -0.08835242 1.0620794 -0.464230693
## age dis rad tax ptratio
## low -0.8585189 0.7392015 -0.6832698 -0.7467001 -0.45104413
## med_low -0.3350773 0.4316354 -0.5489875 -0.4877512 -0.08402393
## med_high 0.3863600 -0.3600747 -0.4099119 -0.2896182 -0.26163461
## high 0.8257121 -0.8686921 1.6392096 1.5148289 0.78203563
## black lstat medv
## low 0.38468514 -0.75240467 0.53107331
## med_low 0.31538347 -0.16205421 0.01650618
## med_high 0.08704716 0.03677049 0.10171794
## high -0.84104816 0.90847803 -0.71620500
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08955590 0.64336783 -1.00506708
## indus -0.05522375 -0.20231081 0.25460518
## chas -0.09660283 -0.04070963 0.04332074
## nox 0.36638439 -0.85532817 -1.19613326
## rm -0.10816053 -0.07753404 -0.12069285
## age 0.24969515 -0.30115351 0.13354400
## dis -0.09930625 -0.30957524 0.59826457
## rad 3.14491919 1.00776866 -0.18742432
## tax 0.08295319 -0.07652879 0.69360956
## ptratio 0.10923113 -0.03283484 -0.27277266
## black -0.13462280 0.03461104 0.08613158
## lstat 0.22463139 -0.23505020 0.38159522
## medv 0.17914980 -0.43591027 -0.06716101
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9549 0.0345 0.0106
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
l <- plot(lda.fit, dimen = 2, col = classes, pch = classes)
l + lda.arrows(lda.fit, myscale = 1)
## integer(0)
# saving the correct classes from test data
correct_classes <- test$crime
# removing the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 3 2 0
## med_low 4 20 11 0
## med_high 1 4 19 1
## high 0 0 0 20
From the cross table we can see that high values are predicted very nicely, but in the lower classes more errors occure.
# Boston dataset reading and standardization again
data("Boston")
b_boston_scaled <- scale(Boston)
# Distances with euclidean distance
dist_eu <- dist(b_boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
km <- kmeans(b_boston_scaled, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(b_boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal cluster size is the point where the line drops. In this it seems to be two.
# Clustering again
km2 <- kmeans(b_boston_scaled, centers = 2)
pairs(b_boston_scaled[,1:7], col = km2$cluster)
pairs(b_boston_scaled[,8:14], col = km2$cluster)
km3 <- kmeans(Boston, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km4 <- kmeans(Boston, centers = 2)
pairs(Boston[,1:7], col = km4$cluster)
pairs(Boston[,8:14], col = km4$cluster)
bins <- quantile(Boston$crim)
crime2 <- cut(Boston$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
table(crime2)
## crime2
## low med_low med_high high
## 127 126 126 127
Boston <- dplyr::select(Boston, -crim)
Boston <- data.frame(Boston, crime2)
u <- nrow(Boston)
ind2 <- sample(u, size = u * 0.8)
train2 <- Boston[ind2,]
test2 <- Boston[-ind2,]
lda.fit2 <- lda(crime2 ~ ., data = train2)
lda.fit2
## Call:
## lda(crime2 ~ ., data = train2)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2500000 0.2351485 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 33.348039 5.173333 0.03921569 0.4534382 6.584784 43.77549
## med_low 8.094059 9.092871 0.04950495 0.4921772 6.197436 59.68119
## med_high 2.547368 12.335158 0.15789474 0.6040526 6.402337 81.93684
## high 0.000000 18.100000 0.05660377 0.6798208 6.005179 91.18208
## dis rad tax ptratio black lstat medv
## low 5.657954 3.490196 282.8333 17.54412 390.9058 7.175784 27.00098
## med_low 4.467285 4.742574 321.8812 18.31683 385.1081 11.669109 22.51485
## med_high 2.911674 5.736842 349.4211 17.51789 364.4000 12.678211 24.52947
## high 2.027506 24.000000 666.0000 20.20000 278.7887 18.729623 15.95849
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0059594595 0.0291392963 -0.043654152
## indus 0.0116837992 -0.0194328871 0.028512310
## chas -0.4398527204 -0.4347835204 -0.196120764
## nox 1.9916327959 -5.6218782002 -10.346653879
## rm -0.1892416869 -0.1153374711 -0.261021540
## age 0.0080682351 -0.0142044048 -0.007142634
## dis -0.0553948278 -0.0682297733 0.054422317
## rad 0.4673782227 0.1037951245 -0.004063668
## tax 0.0004496804 -0.0003602709 0.002638328
## ptratio 0.0678345162 0.0524262621 -0.082236100
## black -0.0015035340 0.0003945387 0.001896470
## lstat 0.0294406326 -0.0448026121 0.058101833
## medv 0.0234646334 -0.0462138320 -0.009722294
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9645 0.0275 0.0081
lda.arrows2 <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes2 <- as.numeric(train2$crime2)
# plot the lda results
l <- plot(lda.fit2, dimen = 2, col = classes2, pch = classes2)
l + lda.arrows2(lda.fit2, myscale = 2)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
## integer(0)
Nox and seems to be the most influencal linear separators in analysis without standardization.
I don´t get the colors right. Otherwise nice
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = test$classes)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$clusters)
This week we are using human dataset from the United Nations Development Programme. From this dataset we have selected 8 variables wich are: country - Country Edu2.FM - secundry education rate female to male ratio Labo.FM - Labour force participation rate female to male ratio Life.Exp - Life expectancy at birth GNI - Gender Development Index Mat.Mor - Maternal mortality ratio Ado.Birth - Adolescent birth rate Parli.F - Percent of female representation in Parliament
human <- read.csv("data/human.csv", row.names = 1)
library(GGally)
library(tidyverse)
library(corrplot)
library(dplyr)
library(knitr)
library(corrplot)
library(tidyr)
library(reshape2)
library(plotly)
library(FactoMineR)
library(ggplot2)
str(human)
## 'data.frame': 148 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1980 Min. :0.1857 Min. : 7.00 Min. :49.00
## 1st Qu.:0.8097 1st Qu.:0.5971 1st Qu.:11.57 1st Qu.:68.38
## Median :0.9444 Median :0.7504 Median :13.55 Median :74.35
## Mean :0.8766 Mean :0.7006 Mean :13.42 Mean :72.44
## 3rd Qu.:0.9990 3rd Qu.:0.8385 3rd Qu.:15.32 3rd Qu.:77.45
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 680 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 5190 1st Qu.: 11.0 1st Qu.: 12.18 1st Qu.:12.15
## Median : 12408 Median : 45.5 Median : 31.30 Median :19.50
## Mean : 18402 Mean :120.9 Mean : 43.72 Mean :20.95
## 3rd Qu.: 25350 3rd Qu.:170.0 3rd Qu.: 69.05 3rd Qu.:27.82
## Max. :123124 Max. :730.0 Max. :175.60 Max. :57.50
pairs <- ggpairs(human)
pairs
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()
From the summary we can see that distributions differ between variables. GNI min is 680 and max is 123124 with comparison of Labo.FM with range of 0.1857-1.0380. In the dataset values are different. Correlation between variables differ from strongly positive correlation (maternal mortality - rate of adolescent women giving birth) to negative corelation (maternal mortality - life expectancy).
#PCA analysis
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.862483e+04 1.420218e+02 2.116337e+01 1.147634e+01 3.608478e+00
## [6] 1.571358e+00 1.923557e-01 1.512244e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Edu2.FM -4.649714e-06 0.0006867686 -0.0013662633 2.317587e-04
## Labo.FM -9.719607e-08 -0.0003158778 0.0003015743 4.425582e-03
## Edu.Exp -8.706739e-05 0.0087580112 0.0053006118 3.255666e-02
## Life.Exp -2.533543e-04 0.0333850374 -0.0048336882 7.494569e-02
## GNI -9.999893e-01 -0.0046050756 -0.0003728181 -6.050839e-05
## Mat.Mor 4.477844e-03 -0.9863064242 0.1609608588 1.124047e-02
## Ado.Birth 1.111169e-03 -0.1611663247 -0.9864335638 -3.008031e-02
## Parli.F -5.580863e-05 0.0034657989 -0.0314144646 9.961287e-01
## PC5 PC6 PC7 PC8
## Edu2.FM -0.0055155087 1.762118e-02 6.384062e-01 7.694765e-01
## Labo.FM 0.0052146757 3.217052e-02 7.688482e-01 -6.385848e-01
## Edu.Exp 0.1415081241 9.886674e-01 -3.580796e-02 8.073940e-03
## Life.Exp 0.9861585472 -1.437809e-01 4.420684e-03 6.632638e-03
## GNI -0.0001023982 -2.864828e-05 -1.086468e-06 -1.752803e-06
## Mat.Mor 0.0337070459 2.675068e-03 1.432260e-04 1.224251e-03
## Ado.Birth 0.0039566221 7.122449e-03 -7.522696e-04 -1.109192e-03
## Parli.F -0.0791032655 -2.145731e-02 -2.750969e-03 1.847856e-03
#Summary of principal component analysis
s <- summary(pca_human)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# creating an object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# drawing a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink"), xlab = pc_lab[1], ylab = pc_lab[2], main = (title = "PCA_non-scaled"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
From not scaled data pca is not useful because of large impact of GNI which has the largest SD and resulting first components 100% of variance. Therefore we must scale the data to make valid analysis.
# standardize the variables
human_std <- scale(human)
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-3.1089 Min. :-2.6159 Min. :-2.42684 Min. :-3.0723
## 1st Qu.:-0.3065 1st Qu.:-0.5256 1st Qu.:-0.69805 1st Qu.:-0.5329
## Median : 0.3108 Median : 0.2531 Median : 0.04826 Median : 0.2503
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.5610 3rd Qu.: 0.7005 3rd Qu.: 0.71899 3rd Qu.: 0.6566
## Max. : 2.8414 Max. : 1.7144 Max. : 2.56114 Max. : 1.4495
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9515 Min. :-0.7355 Min. :-1.1573 Min. :-1.8197
## 1st Qu.:-0.7094 1st Qu.:-0.6742 1st Qu.:-0.8466 1st Qu.:-0.7643
## Median :-0.3218 Median :-0.4626 Median :-0.3333 Median :-0.1258
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3730 3rd Qu.: 0.3009 3rd Qu.: 0.6799 3rd Qu.: 0.5973
## Max. : 5.6228 Max. : 3.7352 Max. : 3.5397 Max. : 3.1749
#PCA analysis with scaled variables
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0333042 1.1429035 0.8966375 0.7974445 0.7016064 0.5533825 0.4782085
## [8] 0.3039769
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.32465414 0.09227181 -0.43377536 0.71962839 -0.3646631
## Labo.FM 0.02176326 0.72972518 -0.51145278 -0.19826256 0.3349423
## Edu.Exp -0.42571462 0.14257522 -0.03704407 -0.14368290 0.1144547
## Life.Exp -0.44696256 -0.02108739 0.13444430 -0.07252202 0.1544074
## GNI -0.36191020 0.03284391 -0.10502738 -0.55841503 -0.6927543
## Mat.Mor 0.44847613 0.16196165 -0.07472686 -0.20980104 -0.2594996
## Ado.Birth 0.41583795 0.14603035 -0.06162895 0.13254030 -0.3781145
## Parli.F -0.08992529 0.62416308 0.71441898 0.20859522 -0.1663542
## PC6 PC7 PC8
## Edu2.FM -0.102011694 0.06004801 0.18184654
## Labo.FM -0.113322727 -0.17601955 -0.10061935
## Edu.Exp 0.611088764 0.62425612 0.01404528
## Life.Exp 0.296297409 -0.63847349 0.50711229
## GNI -0.176468208 -0.08563262 -0.16340648
## Mat.Mor 0.004512949 0.24190007 0.77276191
## Ado.Birth 0.683004537 -0.31602881 -0.27395047
## Parli.F -0.133691243 0.04841963 -0.02315017
# rounded percentages of variance captured by each PC
s_st <- summary(pca_human_std)
pca_pr_st <- round(100*s_st$importance[2,], digits = 1)
# creating an object pc_lab to be used as axis labels
pc_lab_st <- paste0(names(pca_pr_st), " (", pca_pr_st, "%)")
# drawing a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 0.8), col = c("grey40", "blue"), xlab = pc_lab_st[1], ylab = pc_lab_st[2], main = (title = "PCA_scaled"))
With scaling analysis shows a more reliable result. Rwanda seems to be an outlier in this 2-dimensional biplot. PC1 + PC2 are accounted for 68% of the variance which is quite a lot.
Arrow show similar effect in GII - gender inequality categories (labour and parliament) and welfare categories (education, income, life expectancy and maternal health). Seeing the angles between these groups, it seems that they correlate quite well with each other.
data(tea)
# What is tea all about
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
Tea dataset includes 36 variables describing tea-taking habits with 300 observations. Most of the variables are strings and some are bimodal. This dataset is too large for a reasonable analysis and therefore I have picked variables with health attributes.
# column names to keep in the dataset
keep_columns <- c("Tea", "sugar", "pub", "friends", "sport", "healthy", "effect.on.health", "slimming", "relaxing")
# selecting the 'keep_columns' to create a new dataset
tea_health <- select(tea, one_of(keep_columns))
## Warning: Unknown columns: `sport`
# look at the summaries and structure of the tea_health data
summary(tea_health)
## Tea sugar pub friends
## black : 74 No.sugar:155 Not.pub:237 friends :196
## Earl Grey:193 sugar :145 pub : 63 Not.friends:104
## green : 33
## healthy effect.on.health slimming
## healthy :210 effect on health : 66 No.slimming:255
## Not.healthy: 90 No.effect on health:234 slimming : 45
##
## relaxing
## No.relaxing:113
## relaxing :187
##
str(tea_health)
## 'data.frame': 300 obs. of 8 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
gather(tea_health) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
pairs <- ggpairs(tea_health)
pairs
Most of the tea drinkers drink earl grey with friends. 210 think tea is healthy but only 66 think tea has an effect on health. Majority (187) think tea is relaxing. Sugar is used more often with Earl Grey
# multiple correspondence analysis
mca <- MCA(tea_health, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_health, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.177 0.175 0.153 0.129 0.123 0.112
## % of var. 15.749 15.527 13.567 11.437 10.899 9.959
## Cumulative % of var. 15.749 31.275 44.842 56.279 67.178 77.137
## Dim.7 Dim.8 Dim.9
## Variance 0.099 0.087 0.071
## % of var. 8.804 7.731 6.329
## Cumulative % of var. 85.940 93.671 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.309 0.179 0.087 | 0.395 0.298 0.142 | -0.111 0.027
## 2 | -0.530 0.528 0.259 | 0.622 0.739 0.357 | 0.121 0.032
## 3 | -0.202 0.076 0.086 | -0.251 0.120 0.133 | -0.036 0.003
## 4 | -0.129 0.031 0.025 | -0.184 0.065 0.052 | -0.346 0.261
## 5 | 0.133 0.033 0.020 | 0.269 0.138 0.082 | -0.136 0.040
## 6 | -0.350 0.230 0.191 | 0.043 0.003 0.003 | -0.114 0.028
## 7 | -0.202 0.076 0.086 | -0.251 0.120 0.133 | -0.036 0.003
## 8 | -0.610 0.700 0.390 | 0.323 0.200 0.110 | 0.221 0.106
## 9 | 0.133 0.033 0.020 | 0.269 0.138 0.082 | -0.136 0.040
## 10 | -0.610 0.700 0.390 | 0.323 0.200 0.110 | 0.221 0.106
## cos2
## 1 0.011 |
## 2 0.014 |
## 3 0.003 |
## 4 0.181 |
## 5 0.021 |
## 6 0.020 |
## 7 0.003 |
## 8 0.051 |
## 9 0.021 |
## 10 0.051 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | -0.512 4.559 0.086 -5.064 | 0.560 5.528 0.103
## Earl Grey | 0.363 5.991 0.238 8.438 | -0.379 6.599 0.259
## green | -0.977 7.410 0.118 -5.940 | 0.959 7.243 0.114
## No.sugar | -0.360 4.719 0.138 -6.432 | 0.367 4.981 0.144
## sugar | 0.385 5.044 0.138 6.432 | -0.392 5.325 0.144
## Not.pub | -0.041 0.092 0.006 -1.366 | 0.267 4.033 0.268
## pub | 0.153 0.348 0.006 1.366 | -1.005 15.170 0.268
## friends | 0.173 1.383 0.057 4.112 | -0.340 5.416 0.218
## Not.friends | -0.326 2.607 0.057 -4.112 | 0.641 10.207 0.218
## healthy | -0.488 11.773 0.556 -12.896 | -0.227 2.584 0.120
## v.test Dim.3 ctr cos2 v.test
## black 5.537 | 0.936 17.710 0.287 9.264 |
## Earl Grey -8.792 | -0.108 0.613 0.021 -2.506 |
## green 5.831 | -1.469 19.430 0.267 -8.928 |
## No.sugar 6.562 | 0.351 5.200 0.131 6.267 |
## sugar -6.562 | -0.375 5.559 0.131 -6.267 |
## Not.pub 8.957 | -0.092 0.546 0.032 -3.080 |
## pub -8.957 | 0.345 2.053 0.032 3.080 |
## friends -8.079 | 0.085 0.382 0.013 2.006 |
## Not.friends 8.079 | -0.159 0.720 0.013 -2.006 |
## healthy -5.999 | 0.021 0.025 0.001 0.549 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.255 0.271 0.461 |
## sugar | 0.138 0.144 0.131 |
## pub | 0.006 0.268 0.032 |
## friends | 0.057 0.218 0.013 |
## healthy | 0.556 0.120 0.001 |
## effect.on.health | 0.345 0.131 0.181 |
## slimming | 0.043 0.010 0.379 |
## relaxing | 0.017 0.235 0.023 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
With this analysis we can see that 31.275% of this models variance can be explained with the first 2 dimensions. It seems that social aspect is more affected by the Dim 2 and ohysical health with Dim 1. According to minor clusters it seems that people drinking tea with friens use sugar and they think tea is relaxing compaired with people drinking tea without friends use black tea with no sugar.